knitr::opts_chunk$set(echo = TRUE)
getwd()
spruce.df = read.csv("SPRUCE.csv") tail(spruce.df)
library(s20x) trendscatter(Height~BHDiameter,f=0.5,data=spruce.df)
spruce.lm <- with(spruce.df, lm(Height~BHDiameter)) height.res <- resid(spruce.lm) height.fit <- fitted(spruce.lm)
plot(height.res, height.fit)
trendscatter( height.fit,height.res)
The first trendscatter is more of a line while the second appears to be quadratic.
plot(spruce.lm, which =1) normcheck(spruce.lm,shapiro.wilk = TRUE)
The p-value is 0.29
Since the p value > .05, we cannot cannot reject the null hypothesis that the data is normally distributed.
I think when we apply a straight line to this data set, we are making the assumption that the data is normally distributed. In this case, we have shown that there is an acceptable probability that the data set is indeed normally distributed by showing that the p-value is greater than .05. Therefore, I will conclude that we can apply a straight line to the data set with an acceptable certainty that it will be valid.
#### Make a quadratic model quad.lm=lm(Height~BHDiameter + I(BHDiameter^2),data=spruce.df) # Find the coefficients coef(quad.lm) #Make a function that produces heights for inputs "x" myplot=function(x){ 0.86089580 +1.46959217*x -0.02745726*x^2 } plot(Height~BHDiameter, data= spruce.df) # add the quadratic to the points curve(myplot, lwd=2, col="steelblue",add=TRUE) quad.fit = fitted(quad.lm) plot(quad.lm, which =1) normcheck(quad.lm, shapiro.wilk = T)
The p-value is .684
The p-value is > .05 so we can apply the curve to it with an acceptable certainty that the data is normally distributed.
summary(quad.lm)
b0 = .8609
b1 = 1.4696
b2 = -0.0275
ciReg(quad.lm)
y = 0.8609+1.4696x-.0275x^2
predict(quad.lm, data.frame(BHDiameter = c(15,18,20)))
Values in quad.lm are higher than they are in spruce.lm
Value is 0.7741 which is higher than the previous model (0.6569).
quad.lm
Percentage of variation caused by independent variable
quad.lm
anova(spruce.lm) anova(quad.lm) summary(quad.lm)
quad.lm has a lower RSS so it is the more accurate model
yhat = quad.fit RSS=with(spruce.df,sum((Height-yhat)^2)) MSS=with(spruce.df,sum((yhat-mean(Height))^2)) TSS=with(spruce.df,sum((Height-mean(Height))^2)) print(paste0("TSS: ", TSS)) print(paste0("MSS: ", MSS)) print(paste0("RSS: ", RSS)) print(paste0("MSS/TSS: ", MSS/TSS))
cooks20x(quad.lm)
According to "datasciencecentral.com", Cook's Distance is used in regression analysis to find outliers that may be negatively affecting our regression model. The measurement is a combination of each observation's leverage and residual value. So, the higher the leverage and residuals, the higher the Cook's distance.
My research has shown that there are several methods of determining when we should be interested in a possible influential Cook's distance. The first and second ones say that if the distance is greater than .5 and 1, respectively. Another method says it could be helpful to just use visual methods. If we use 1 and 2, there would seemingly be no influential distance. But looking at the graph there is clearly what looks to be an influential observation (24) and a couple of possible observations in 18, 21.
quad2.lm=lm(Height~BHDiameter + I(BHDiameter^2) , data=spruce.df[-24,]) summary(quad2.lm) summary(quad.lm)
Nearly all of the metrics are larger in the quad2.lm model, however it doesn't seem like they are much larger. So it is obvious that there was some influence by the observation we removed, but I don't think it was having a negative impact on the original model.
Start with two line equations and a common $x_k$ knot point:
$L_1: y = \beta_0 + \beta_1x$
$L_2: y = \beta_0 + \delta + (\beta_1 + \beta_2)x$
Since $x_k$ is common to both lines, we will plug it in:
$y_k = \beta_0 + \beta_1x_k = \beta_0 + \delta + (\beta_1 + \beta_2)x_k$
$\quad \quad \beta_0 + \beta_1x_k + \delta + \beta_1x_k + \beta_2x_k$
$\quad \quad 0 = \delta + \beta_2 x_k$
$\quad \quad \delta = -\beta_2 x_k$
Now we look at $L_2$:
$L_2 : y = \beta_0 + \delta + (\beta_1 + \beta_2)x$
Replace $\delta$ with what we found earlier:
$L_2 : y = \beta_0 -\beta_2 x_k + (\beta_1 + \beta_2)x$
$\quad\quad y = \beta_0 - \beta_2x_k + \beta_1x + \beta_2x$
$\quad\quad y = \beta_0 + \beta_1x + \beta_2x - \beta_2x_k$
$\quad\quad y = \beta_0 + \beta_1x + \beta_2(x-x_k)$
We can easily notice that this is the equation $L_1$ with an added part $\beta_2(x-x_k)$
Now we define the indicator function as $I(x>x_k)$ and it will return 1 if $x > x_k$ and 0 if else.
Thus, we have $y = \beta_0 + \beta_1x + \beta_2(x-x_k)I(x>x_k) \square$
coef(spruce.lm) ## piecewise linear model in R ## Model y = b0 + b1x + b2(x-xk)*(x>xk) ## You will need to change the code appropriately sp2.df=within(spruce.df, X<-(BHDiameter-18)*(BHDiameter>18)) # this makes a new variable and places it within the same df lmp=lm(Height~BHDiameter + X,data=sp2.df) tmp=summary(lmp) myf = function(x,coef){ coef[1]+coef[2]*(x) + coef[3]*(x-18)*(x-18>0) } plot(spruce.df,main="Piecewise regression") myf(0, coef=tmp$coefficients[,"Estimate"]) curve(myf(x,coef=tmp$coefficients[,"Estimate"] ),add=TRUE, lwd=2,col="Blue") abline(v=18) text(18,16,paste("R sq.=",round(tmp$r.squared,4) ))
library(grac0009MATH4753) mydata <- grac0009MATH4753::myread("SPRUCE.csv") head(mydata, 4)
The code simply takes in a .CSV file that is in the current working directory and returns the data as a table.
library(grac0009MATH4753) spruce.df = grac0009MATH4753::myread("SPRUCE.csv") quad.lm=lm(Height~BHDiameter + I(BHDiameter^2),data=spruce.df) quad.fit = fitted(quad.lm) library(base) myplot=function(x){ 0.86089580 +1.46959217*x -0.02745726*x^2 } plot(Height~BHDiameter, data= spruce.df, main="Spruce height prediction", bg="blue", pch=21,cex=1.1, xlim = c(0, max(BHDiameter)), ylim = c(0, max(Height))) # add the quadratic to the points curve(myplot, lwd=2, col="steelblue",add=TRUE) with(spruce.df, segments(BHDiameter,Height,BHDiameter,myplot(BHDiameter),col="Black")) lgcooks = spruce.df[c(18,21,24),] with(lgcooks, segments(BHDiameter,Height,BHDiameter,myplot(BHDiameter),col="red",lwd = 3)) with(spruce.df, text(Height~BHDiameter, labels=row.names(spruce.df), pos = 4, cex = 0.5)) with(spruce.df, arrows(BHDiameter[18], Height[24], BHDiameter[24], Height[24], col="Blue", lwd = 2)) with(spruce.df, text(BHDiameter[18], Height[24], labels = c("Highest Cook's\ndistance"),pos=2,cex=1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.